智能论文笔记

Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap

Luca Pesce , Bruno Loureiro , Florent Krzakala , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2022-05-26

A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.

translated by 谷歌翻译

Error Rates for Kernel Classification under Source and Capacity Conditions

Hugo Cui , Bruno Loureiro , Florent Krzakala , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2022-01-29

我们考虑内核分类的问题。内核回归的作品表明，预测误差的衰减率与大量数据集的样品数量的数量有两个数量：数据集的容量和来源。在这项工作中，我们计算了高斯设计下错误分类（预测）错误的衰减率，以满足源和容量假设的数据集。我们得出了两个标准内核分类设置的源和容量系数的函数，即边缘最大化支持向量机（SVM）和脊分类，并将两种方法对比。结果，我们发现该类别的数据集已知的最差案例频率松散。最后，我们表明，在实际数据集中还观察到了这项工作中介绍的费率。

translated by 谷歌翻译

Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions

Bruno Loureiro , Gabriele Sicuro , Cédric Gerbelot , Alessandro Pacco , Florent Krzakala , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2021-06-07

多级分类问题的广义线性模型是现代机器学习任务的基本构建块之一。在本手稿中，我们通过具有任何凸损耗和正规化的经验风险最小化（ERM）来描述与通用手段和协方士的k $高斯的混合。特别是，我们证明了表征ERM估计的精确渐近剂，以高维度，在文献中扩展了关于高斯混合分类的几个先前结果。我们举例说明我们在统计学习中的两个兴趣任务中的两个任务：a）与稀疏手段的混合物进行分类，我们研究了$ \ ell_2 $的$ \ ell_1 $罚款的效率; b）Max-Margin多级分类，在那里我们在$ k> 2 $的多级逻辑最大似然估计器上表征了相位过渡。最后，我们讨论了我们的理论如何超出合成数据的范围，显示在不同的情况下，高斯混合在真实数据集中密切地捕获了分类任务的学习曲线。

translated by 谷歌翻译

Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime

Hugo Cui , Bruno Loureiro , Florent Krzakala , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2021-05-31

在本手稿中，我们考虑在高斯设计下的内核Ridge回归（KRR）。根据特征的幂律衰减，在各种作品中报告了KRR过度概括误差衰减的指数。然而，这些衰变是为虚拟化的不同设置提供，即在无噪声案例中，在恒定正则化和嘈杂的最佳正则化案例中。中介设置已留下了大幅上未公布的。在这项工作中，我们统一并扩展了这一工作，提供了所有制度的表征和可以在噪声和正则化相互作用方面观察到的超出误差衰减率。特别是，我们展示了随着样本复杂性增加了无噪音指数与其嘈杂值之间的嘈杂设置中的过渡。最后，我们说明了如何在真实数据集上观察到该交叉。

translated by 谷歌翻译

Bayesian reconstruction of memories stored in neural networks from their connectivity

Sebastian Goldt , Florent Krzakala , Lenka Zdeborová , Nicolas Brunel

分类： (统计)机器学习

2021-05-16

大型神经回路的全面突触接线图的出现已经创造了连接组学领域，并引起了许多开放研究问题。一个问题是，鉴于其突触连接矩阵，是否可以重建存储在神经元网络中的信息。在这里，我们通过确定在特定的吸引力网络模型中可以解决这种推理问题何时解决这个问题，并提供一种实用算法来解决这个问题。该算法基于从统计物理学到进行近似贝叶斯推论的思想，并且可以进行精确的分析。我们在三种不同模型上研究了它的性能，将算法与PCA等标准算法进行比较，并探讨了从突触连通性中重建存储模式的局限性。

translated by 谷歌翻译

Learning curves of generic features maps for realistic datasets with a teacher-student model

Bruno Loureiro , Cédric Gerbelot , Hugo Cui , Sebastian Goldt , Florent Krzakala , Marc Mézard , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2021-02-16

教师 - 学生模型提供了一个框架，其中可以以封闭形式描述高维监督学习的典型情况。高斯I.I.D的假设然而，可以认为典型教师 - 学生模型的输入数据可以被认为过于限制，以捕获现实数据集的行为。在本文中，我们介绍了教师和学生可以在不同的空格上行动的模型的高斯协变态概括，以固定的，而是通用的特征映射。虽然仍处于封闭形式的仍然可解决，但这种概括能够捕获广泛的现实数据集的学习曲线，从而兑现师生框架的潜力。我们的贡献是两倍：首先，我们证明了渐近培训损失和泛化误差的严格公式。其次，我们呈现了许多情况，其中模型的学习曲线捕获了使用内核回归和分类学习的现实数据集之一，其中盒出开箱特征映射，例如随机投影或散射变换，或者与散射变换预先学习的 - 例如通过培训多层神经网络学到的特征。我们讨论了框架的权力和局限性。

translated by 谷歌翻译

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

Francesca Mignacco , Florent Krzakala , Pierfrancesco Urbani , Lenka Zdeborová

分类：机器学习 | (统计)机器学习

2020-06-10

我们以封闭的形式分析了随机梯度下降（SGD）的学习动态，用于分类每个群集的高位高斯混合的单层神经网络，其中每个群集分配两个标签中的一个。该问题提供了具有内插制度的非凸损景观的原型和大的概括间隙。我们定义了一个特定的随机过程，其中SGD可以扩展到我们称呼随机梯度流的连续时间限制。在全批处理中，我们恢复标准梯度流。我们将动态平均场理论从统计物理应用于通过自成的随机过程跟踪高维极限中算法的动态。我们探讨了算法的性能，作为控制参数脱落灯的函数，它如何导航损耗横向。

translated by 谷歌翻译

Tree-AMP: Compositional Inference with Tree Approximate Message Passing

Antoine Baker , Benjamin Aubin , Florent Krzakala , Lenka Zdeborová

分类： (统计)机器学习 | 机器学习

2020-04-03

我们介绍树-AMP，站在树近似消息传递，用于高维树结构模型的组成推理的Python包。该包提供统一框架，用于研究以前导出的多种机器学习任务的几种近似消息传递算法，例如广义线性模型，多层网络的推断，矩阵分解和使用不可惩罚的重建。对于某些型号，可以通过状态进化理论上预测算法的渐近性能，并通过自由熵形式主义估计的测量熵。通过设计模块化：实现因子的每个模块可以与其他模块一起组成，以解决复杂的推理任务。用户只需要声明模型的因子图：推理算法，状态演化和熵估计是完全自动化的。

translated by 谷歌翻译

Unsigned Play by Milan Kundera? An Authorship Attribution Study

Lenka Jungmannová , Petr Plecháč

分类：自然语言处理

2022-12-19

In addition to being a widely recognised novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitel\'e kl\'i\v{c}\r{u}, 1961), The Blunder (Pt\'akovina, 1967), and Jacques and his Master (Jakub a jeho p\'an, 1971). In recent years, however, the hypothesis has been raised that Kundera is the true author of a fourth play: Juro J\'ano\v{s}\'ik, first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera's student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro J\'ano\v{s}\'ik, with results strongly supporting the hypothesis of Kundera's authorship.

translated by 谷歌翻译

Regressing Relative Fine-Grained Change for Sub-Groups in Unreliable Heterogeneous Data Through Deep Multi-Task Metric Learning

Niall O' Mahony , Sean Campbell , Lenka Krpalkova , Joseph Walsh , Daniel Riordan

分类：机器学习 | 人工智能

2022-08-11

在人工智能的许多应用中，细粒度的变化检测和回归分析至关重要。实际上，由于缺乏可靠的基础真理信息和复杂性，因此这项任务通常是有挑战性的。因此，开发一个可以代表多个信息源的相关性和可靠性至关重要的框架。在本文中，我们调查了如何将多任务指标学习中的技术应用于实际数据中的细粒度变化。关键思想是，如果我们将一个单个对象的特定实例之间的兴趣指标中的增量变化纳入作为多任务指标学习框架中的一项任务，然后解释该限制将使用户被警报以对整体度量的整体度量不变。研究的技术是专门针对处理异质数据源的专门量身定制的。每个任务的输入数据可能包含缺失的值，该值的比例和分辨率在任务之间不存在，并且数据包含非独立且相同分布的（非IID）实例。根据我们最初的实验实施结果的结果，并讨论了该域中的相关研究，这可能为进一步的研究提供了方向。

translated by 谷歌翻译